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ABSTRACT 



Digital image signal interpretation is the process of under- 
standing the content of an image through the identification 
of significant objects or regions in the image and analysing 
their spatial arrangement. Traditionally the task of image 
interpretation required human analysis. This is expensive 
and time consuming, consequently considerable research 
has been directed towards constructing automated image 
interpretation systems, A method of interpreting a digital 
video signal is disclosed whereby the digital video signal has 
contextual data. The method comprising the steps of firstly, 
segmenting the digital video signal into one or more video 
segments, each segment having a corresponding portion of 
the contextual data. Secondly, analysing each video segment 
to provide a graph at one or more temporal instances in the 
respective video segment dependent upon the corresponding 
portion of the contextual data. 

44 Claims, 11 Drawing Sheets 
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AUTOMATED VIDEO INTERPRETATION 
SYSTEM 

FIELD OF INVENTION 

The present invenlion relates to the statistical analysis of 
digital video signals, and in particular to the statistical 
analysis of digital video signals for automated content 
interpretation in terms of semantic labels. The labels can be 
subsequently used as a basis for tasks such as content-based 
retrieval and video abstract generation. 

DESCRIPTION OF THE PRIOR ART 

Digital video is generally assumed to be a signal repre- 
senting the time evolution of a visual scene. This signal is 
typically encoded along with associated audio information 
(eg,, in the MPEG-2 audiovisual coding format). In some 
cases information about the scene, or the capture of the 
scene, might also be encoded with the video and audio 
signals. The digital video is typically represented by a 
sequence of still digital images, or frames, where each 
digital image usually consists of a set of pixel intensities for 
a multiplicity of colour channels (eg., R, G, B). This 
representation is due, in a large part, to the grid-based 
manner in which visual scenes are sensed. 

The visual, and any associated audio signals, are often 
mutually correlated in the sense that information about the 
content of the visual signal can be found in the audio signal 
and vice-versa. This correlation is explicitly recognised in 
more recent digital audiovisual coding formats, such as 
MPEG-4, where the units of coding are audiovisual objects 
having spatial and temporal localisation in a scene. Although 
this representation of audiovisual information is more 
attuned to the usage of the digital material, the visual 
component of natm-al scenes is still typically captured using 
grid-based sensing techniques (ie., digital images are sensed 
at a frame rate defined by the capture device). Thus the 
process of digital video interpretation remains typically 
based on that of digital image interpretation and is usually 
considered in isolation from the associated audio informa- 
tion. 

Digital image signal interpretation is the process of under- 
standing the content of an image through the identification 
of significant objects or regions in the image and analysing 
their spatial arrangement. Traditionally the task of image 
interpretation required human analysis. This is expensive 
and time consuming, consequently considerable research 
has been directed towards constructing automated image 
interpretation systems. 

Most existing image interpretation systems involve low- 
level and high-level processing. Typically, low-level pro- 
cessing involves the transformation of an image from an 
array of pixel intensities to a set of spatially related image 
primitives, such as edges and regions. Various features can 
then be extracted from the primitives (eg., average pixel 
intensities). In high-level processing image domain knowl- 
edge and feature measurements are used to assign object or 
region labels, or interpretations, to the primitives and hence 
consUiict a description as to "what is present in the image". 

Early attempts at image interpretation were based on 
classifying isolated primitives into a finite number of object 
classes according to their feature measurements. The suc- 
cess of this approach was limited by the erroneous or 
incomplete results that often result from low-level process- 
ing and feature measurement errors that result from the 
presence of noise in the image. Most recent techniques 
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incorporate spatial constraints in the high-level processing. 
This means that ambiguous regions or objects can often be 
recognised as the result of successful recognition of neigh- 
bouring regions or objects. 

5 More recently, the spatial dependence of region labels for 
an image has been modelled using statistical methods, such 
as Markov Random Fields (MRFs). The main advantage of 
the MRF model is that it provides a general and natural 
model for the interaction between spatially related random 

10 variables, and there are relatively flexible optimisation algo- 
rithms that can be used to find the (globally) optimal 
realisation of the field. Typically the MRF is defined on a 
graph of segmented regions, commonly called a Region 
Adjacency Graph (RAG). The segmented regions can be 

15 generated by one of many available region-based image 
segmentation methods. The MRF model provides a power- 
ful mechanism for incorporating knowledge about the spa- 
tial dependence of semantic labels with the dependence of 
the labels on measurements (low-level features) from the 

20 image. 

Digital audio signal interpretation is the process of under- 
standing the content of an audio signal through the identi- 
fication of words/phrases, or key sounds, and analysing their 
temporal arrangement. In general, investigations into digital 
audio analysis have concentrated on speech recognition 
because of the large number of potential applications for 
resultant technology, eg., natural language interfaces for 
computers and other electronic devices. 

Hidden Markov Models are widely used for continuous 
speech recognition because of their inherent abifity to incor- 
porate the sequential and statistical character of a digital 
speech signal. They provide a probabilistic framework for 
the modelhng of a time-varying process in which imils of 
speech (phonemes, or in some cases words) are represented 
as a time sequence through a set of states. Estimation of the 
transition probabilities between the states requires the analy- 
sis of a set of example audio signals for the unit of speech 
(ie., a training set). If the recognition process is required to 
be speaker independent then the training set must contain 
example audio signals from a range of speakers. 

SUMMARY OF THE INVENTION 

According to one aspect of the present invention there is 
provided a method of interpreting a digital video signal, 
wherein said digital video signal has contextual data, said 
method comprising the steps of: 

segmenting said digital video signal into one or more 
video segments, each segment having a corresponding 
portion of said contextual data; and 
analysing each video segment to provide a graph at one or 
more temporal instances in the respective video seg- 
ment dependent upon said corresponding portion of 
said contextual data. 
55 According to another aspect of the present invention there 
is provided an apparatus for interpreting a digital video 
signal, wherein said digital video signal has contextual data, 
said apparatus comprising: 

means for segmenting said digital video signal into one or 
60 more video segments, each segment having a corre- 
sponding portion of said contextual data; and 
means for analysing each video segment to provide an 
analysis token for one or more regions contained in the 
respective video segment dependent upon said corre- 
65 sponding portion of said contextual data. 

According to still another aspect of the present invention 
there is provided a computer program product comprising a 
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computer readable medium having recorded thereon a com- associated sources (eg., the associated audio signal, recorded 

puter program for interpreting a digital video signal, wherein camera parameters, other sensors apart from the gcneraUy 

said digital video signal has contextual data, said computer used visual spectrum sensors). The extent of the available 

program product comprising: contextual information is much greater than that available to 

means for segmenting said digital video signal into one or 5 a still image analysis process and has the additional property 

more video segments, each segment having a corre- of li^ae evolution. 

sponding portion of said contextual data; and The time evolution of a digital image signal in a video 

means for analysing each video segment to provide an signal can be used to improve the success of the image 

analysis token for one or more regions contained in the interpretation process of a digital video signal. For example, 

respective video segment dependent upon said corre- motion information can be used to assist in the detection of 

sponding portion of said contextual data. moving objects, such as people, in the digitally recorded 

scene. Motion can also be used to selectively group regions 

BRIEF DESCRIPTION OF THE DRAWINGS in an image frame as being part of the background for a 

TTie embodiments of the invention are described herein- '^^ P'^'^^^ of interpreting or understandmg, digital 

after with reference to the drawings, in which: ^^^^^ ^^^nals may also benefit from the identification of 

^ . . 1 T ■,. /. . audio elements (speech and non-speech) in the associated 

FIG. 1 IS a block diagram of a digital video mterpretation audio signal. For example, words identified in an audio 

system according to the preferred embodimem; signal ^^y be able to assist in the interpretation process. 

FIG. 2 illustrates the video segment analyser of FIG. 1 Also the audio signal associated with a wildlife documentary 

according to the preferred embodiment; 20 naay contain the soimds made by various animals that may 

FIGS. 3A and 3B illustrate a representative segmented help identify the content of the video signal, 

image and a corresponding region adjacency graph (RAG), The embodiments of the present invention extend the use 

respectively, in accordance with the embodiments of the of probabilistic models, such as MRFs, to digital video 

invention; signal interpretation. This involves the repeated use of a 

FIG. 4 illustrates a frame event analyser of FIG. 2 and probabilistic model through a video sequence to integrate 

having a single application domain; information, in the form of measurements and knowledge, 

FIG. 5 illustrates an alternative frame event analyser of different sources (eg., video frames, audio signal, etc.) 

FIG. 2 and having multiple application domains; ^ single optimisation procedure. 

FIG. 6 iUustrates the selection of a temporal region of 30 In the embodiments of the invention, a high-level descrip- 

interest (ROI) for a particular analysis event; generated at selected analysis events throughout the 

FIG. 7 illustrates a preferred contextual analyser for use ^^5^*^! ^^eo signal. The high-level description is based on^ 

in the frame event analyser of FIG. 4 or FIG. 5; /an assignment of vanous semantic labels to vanous regions^ 

that are apparent at the analysis event. At each analysis'^ 

FIG.Sillustrateschques associated with the RAG of FIG. event, the video frame centred on the analysis evem is' 

automatically spatially segmeriteci into^ homogeneous 

FIG. 9 is a block diagram of a representative computer for ^regions. These regions and their spatial adjacency properties / 

use with a digital video source, with which the embodiments are represented by a Region Adjacency Graph (RAG). The ^ 

of the invention may be practiced; and probabilistic model is then applied to the RAG. The model / 

FIG. 10 is a block diagram of a representative digital 40 incorporates feature measurements from the regions of the / 

video source, with which the embodiments of the invention frame, contextual information from a Region of Interest > 

may be practiced. (ROI) around the frame, and prior knowledge about the- 

FIG. U illustrates the video segment analyser of FIG. 1 various semantic labels that could be associated with the / 

according to an alternate embodiment, which is optionally regions of the RAG. These semantic labels (eg., "person", 

integrated in a digital video coding system; 45 "sky"» "water", "foUage", etc.) are taken from a list which 

has been typically constmcted for an appropriate application 

DETAILED DESCRIPTION domain (eg., outdoor scenes, weddings, urban scenes, etc). 

1 Overview ^^^^ analysis event, the contextual information is used 

to bias the prior probabilities of the semantic labels 

The present invention relates to a method; apparatus and 50 (hereinafter labels) in a selected appropriate application 

system for automatically generating an abbreviated high- domain. The analysis performed at a given analysis event 

level description of a digital video signal that captures also depends on the previous analysis events. This depen- 

(important) semantic content of the digital video signal in dence ts typically greater if two analysis events are close 

spatial and time domains. Such descriptions can subse- together in the time domain. For example, within a video 

quently be used for numerous purposes including content- 55 segment, it is likely, but not exclusively, that labels selected 

based retrieval, browsing of video sequences or digital video for regions at previous recent analysis events are more 

abstraction. probable than labels that have not been selected in the 

A digital video signal is taken to be the visual signal description of the current section of digital video, 

recorded in a video capture device. The signal would The digital video interpretation system can operate with a 

generally, but not necessarily, be generated from a two- 60 single application domain or multiple apphcation domains, 

dimensional array of sensors (eg., CCD array) at a specified If multiple application domains are being used then contex- 

sampling rate with each sample being represented by a tual information can be used to determine the most probable 

(video) frame. The analysis of the spatial and temporal application domain. The application domains may be narrow 

content of this signal can benefit from a range of contextual (ie., few labels) or broad (ie., many possible labels). Narrow 

information. In some cases this contextual information is 65 application domains would typically be used if very specific 

implicit within the digital video signal (eg., motion), and in and highly reliable region labelling is required. For example, 

other cases the information may be available from other in a security application, it may be desirable to be able to 
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identify regions that are associated with people and cars, but With reference to FIG. 9, the general purpose computer is 

the identification of these objects might be required with coupled to a remote digital video source 1000. The video 

high reliability. interpretation system is implemented as software recorded 

la the following detailed description of the embodiments, on a computer readable medium that can be loaded into and 

numerous specific details such as video encoding 5 carried out by the computer. The computer 900 comprises a 

techniques, sensor types, etc., are set forth to provide a more computer module 902, video display monitor 904, and input 

thorough description. It will be apparent, however, to one devices 920, 922. The computer module 902 itself comprises 

skilled in the art that the invention may be practiced without at least one central processing unit 912, a memory unit 916 

these specific details. In other instances, well-known which typically includes random access memory (RAM) and 

features, such as video formats, audio formats, etc.. have not ^^d only memory (ROM), input output (I/O) interfaces 906, 

been descnbed m details so as not to obscure the invention. 908, 914 including a video interface 906. The I/O interface 

2. Digital Video Interpretation System of the 908 enables the digital video source 1000 to be coupled with 

Preferred Embodiment the computer module 902 and a pointing device such as 

BG. 1 illustrates a probabilistic digital video interpreta- ^^^^ *^22. The storage device 910 may include one or more 

tion system 160 according to the preferred embodiment of 15 of the following devices: a floppy disc, a hard disc drive, a 

the invention. The digital video interpretation system 160 CD-ROM drive, a magnetic tape drive, or similar non- 

comprises a video segmenter 120 and a video segment volatile storage device known to those skilled in the art. The 

analyser 140, and processes digital video source output 110 components 906 to 916 of the computer 902 typically 

which have been generated from a digital video source 100. communicate via an interconnected bus 918 and in a manner 

Preferably the digital video source is a digital video camera. 20 which results in a usual mode of operation of a computer 

The video segmenter 120 is coupled between the digital system 900 known to those in the relevant art. Examples of 

video source 100 and the video segment analyser 140. computer systems on which the embodiments of the inven- 

The digital video interpretation system 160 may option- tion can be practised include IBM PC/Als and compatibles, 

ally be internally or externally implemented in relation to the Macintosh Computers, SunSparcstations, or any of a number 

digital video source 100. When the digital video interpreta- 25 of computer systems well known to those in the art. The 

tion^system 160 is locate^ inside the digital video source 100 c digital video source 1000 is preferably a digital camera that 

(eg a digiUl camera) the interpretation system 160 can ; is capable of recording a video signal into storage (eg.: 

/readily make use of additional camera inform^^^ , ^.^^.y^ ^ ^^^^^^ ^^j. ^^^^ additional 

haying tpexp^itlysto^^^^^^ / information, for instance, infrared data that is associated 

video and audio signals that typical constmite the audiovi- ^^e video signal. The digital video signal and any 

sual signal. F^^ample, camera ^ associated additional (contextual) informaUon may be 

dpt^al.v^deG^eamera.ma^^^^^^^^ motion analysis - downloaded to the computer 900 where the interpretation 

of the digitd video signal llOA. . Further, operator-gaze ' . ^nd labelling processes are performed in accordance with 

location could provide mfonnation about where key objects ^ ^^e embodiments of the invention. 

.m a scene are locate^' and to^ 3^ Alternatively, the embodimem of the invention may be 

.mformatiQn),could be used to generate depth mformation. ;„#.„„in, a- i -a ^nZx 

• which cpuld be used to generate a depth axis for the RAG- E t^n^lll*^/, , ^T' 

-as shOWii in FIG. 3B. / ispwfcrablya digital video camera Tie digita vrfeo 

^ , . source 1000 comprises video capture umt 1002 (eg., mclud- 

The input to the digital video inteipretation system 160 ,s ^ ^^aige couple device) for capturing images and having 
the_4igjtalvideospurceoutput llOthat iscaptu^^^ a„d/or mechanisms for providing focal data and 
c|evice such as a digital video camera TTie digital video^ settings of the video capture unit 1002. The digital 
soiirde 'GiUput«l.M 'is-us^^^ of a digital video ^yeo source 1000 may also include sensois 1004 for 
8.9>»iMIIIHM^^ capturing audio information, ambient and/or environmental 
information m£.jAm^.i^m^^^va^ also be j^ta. positioning data (eg., GPS information) etc. Tlie sen- 
available depending on the c^ 45 sors 1004 and the video capture unit 1002 are connected to 
,f -FS.^Sl^^i^BSl'^^^. as , processing unit of the digital video source 1000, 
focus mform^^to^^^^ ^^ich the embodiments of the invention may be 
ietc.) ^hd other sensor information (eg., mfrared sensing^ practiced. Hie processing unit 1006 is coupled to memory 

In the digital video interpretation system 160, the video ioo8, communications for 1010, and a user interface unit 
segmenter 120 segments the digital video signal into tem- 50 1012. The user interface unit 1012 allows the operator of the 
poral video segments or shots 130 provided at its output. The video source 1000 to specify numerous settings in opera- 
resulting video segments 130 produced by the video seg- tional modes of the digital video source 1000. For example 
menter 120 are provided as mput to the video segment the digital video source operator may select different appU- 
analyser 140. The video segment analyser 140 produces a cation domains (eg.. Outdoor Scenes, Uri?an Scenes, Wed- 
sequence of labelled RAGs for each video segment 130. 55 ding Scenes, etc.) to use with the interpretation system. 

The video segment analyser 140 of RG. 1 generates and Application domains could be downloaded to the capture 

then attempts to optimally label regions in a sequence of device, electronically or via a possible wireless link. The 

RAGs using one or more appropriate application domains. memory 1008 may comprise random access memory, read 

The resulting sequence of label RAGs 150 represents a only memory, and/or non-volatile memory storage devices, 

description of the content of the digital video signal llOA eo Both data and processing instructions for operating the 

and is referred to hereinafter as metadata. processing unit may be stored in the memory 1008. The 

The embodiments of the invention can be implemented communications port 1010 permits communications 

externally in relation to a digital video source as indicated by between the digital video source 1000 with external devices 

the computer 900 depicted in FIG. 9. Alternatively the such as a computer 900 of FIG. 9. The communications port 

digital video interpretation system may be implemented 65 1010 is capable of transmitting and receiving data and 

internally within a digital video source 1000 as depicted in instructions to both the memory 1008 and the processing 

FIG. 10. unit 1006. 
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3. \^deo Segmenter of the Preferred Embodimeot 

The video segmenter 120 segments the digital video 
signal into temporal video segments or shots 130. Infonma- 
tion regarding the motion of pixels in a frame (implicit 
within the digital video signal llOA) and/or any other 
supporting information that may be available in the digital 
audio signal 11 OB or other information HOC can be used to 
assist the segmentation by the video segmenter 120. For 
example, if information about when an operator started and 
stopped recording was available then the video segmentation 
could be based on this information. Known video segmen- 
tation techniques can be implemented in the video seg- 
menter without departing from the scope and spirit of the 
invention. 

4. Video Segment Analyser of the Preferred 
Embodiment 

The video segment analyser 140 generates a sequence of 
labelled RAGs 150 for each video segment 130. The RAGs 
are preferably three-dimensional with range information 
being obtained from the digital video source 100 shown in 
FIG, 1. Each RAG consists of a set of disjoint regions and 
a set of edges connecting the regions. Regions that are 
located in the same X-Y plane are assumed to be coplanar. 
In contrast regions located in different Z planes are assumed 
to correspond to regions at different depths in the depicted 
scene. In general, the use of the depth axis (e.g., Z-axis) in 
the RAG depends on the availability of information to 
indicate that a particular region is located at a different depth 
than one or more other regions. For example, the depth axis 
can be utilised in a digital video interpretation system 160 in 
which focal information, or depth information, is available 
to determine the depth of particular regions. However, the 
video segment analyser 140 can generate a sequence of 
labelled RAGs 150, without the aid of depth information 
treating substantially all disjointed regions as being copl- 
anor. 

FIG. 2 shows the video segment analyser 140 according 
to the preferred embodiment. In block 200 an initial frame 
in a video segment 130 firom the video segmenter 120 of 
FIG. 1 is selected for analysis. A frame event analyser 202 
receives the seleaed frame and temporal region of interests 
(ROIs) for that frame, as described hereinafter with refer- 
ence to FIG. 6, and generates a labelled RAG. Next, in block 
204, the generated RAG is stored and a decision block 206 
is used to determine if the end of the video segment 130 has 
been reached. If the end of the video is reached, that is the 
decision block 206 returns true (yes), video segment pro- 
cessing terminates in block 208. Otherwise the decision 
block 206 returns false (no), a next frame to be analysed in 
the 25 video segment 130 is retrieved and processing returns 
back to the frame event analyser 202. Preferably, each frame 
of a video segment 130 is selected and analysed in real-time. 
However, in practise the selection of frames to be analysed 
depends upon the application in which the digital video 
interpretation system is used. For example, real-time per- 
formance may not be possible when analysing each frame in 
some devices, comprising the digital video interpretation 
system, in which case only predetermined frames of a video 
segment are selected for analyses. 

FIG. 3B is an example of a three-dimensional RAG 310 
for the spatially segmented frame 300 shown in FIG. 3 A. 
The spatially segmented frame 300 contains nine regions 
named Rl to R9. The region Rl contains sky. The regions 
R2, R3, and R9 contain land and the region R8 is a road. The 
region R4 is a house-like stmcture, and the regions R5 and 
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R6 are projecting structures in the house. To indicate depth 
in FIG. 3A, border of regions are indicated with several 
thicknesses. In particular, the thickness of the respective 
borders indicate the frontedness of depth in the Z axis. The 

5 RAG 310 indicates connected edges of regions Rl to R9 in 
the segmented frame 300. The regions Rl, R2, R3, R7, R8 
and R9 are all located at the same approximate depth (as 
indicated by solid edge lines) in the RAG 310 but at different 
X-Y positions. The region Rl is sequentially connected to 

10 regions R2, R8, R9 on the one hand and to regions R3 and 
R7 on the other hand. In turn, the region R4 has an edge with 
region R2, R3, R7 and R8, but has a different depth as 
indicated by dashed or broken edge lines. Finally, the 
regions R5 and R6 share an edge with the region R4 but at 

15 a different, parallel depth indicated by dotted edge lines. 
Thus, the dashed and dotted lines cross different Z-planes. 

5. Frame Event Analyser 

The functionality of the frame event analyser 202 of FIG. 
2 is described in greater detail with reference to FIG. 4 and 
FIG, 5. The steps of the frame event analyser shown in FIG. 
4 uses a single application domain (eg., outdoor scenes). 
Such an application domain could contain the knowledge 
and functionality to label frame regions, such as sky, water, 
foliage, grass, road, people, etc. 

In FIG. 4, the current frame and ROIs 400 for each of the 
information sources (eg., the digital video signal, digital 
audio signal, etc.) are provided to a contextual analyser 410 
that uses the ROIs 400. In addition to the contextual analyser 
410, the frame event analyser 202 of FIG. 2 comprises a 
frame segmenter 450, an adjustment unit 430 for adjusting 
prior probabilities of labels in an application, and a region 
analyser 470. 

35 The contextual information 400 available in a ROIs is 
analysed by a contextual analyser 410. Since there are 
muhiple sources of contextual information, the contextual 
analyser 410 typically includes more than one contextual 
analysing unit. 

40 naore detailed illustration of the contextual analyser 7 
410, is provided in FIG. 7 in relation to the adjusting unit - 
430 and the application domain 440. The contextual analyser/ 
^ 410 shown in FIG. 7 receives contextual information 400 for 
the frame event, which preferably comprises and audio ROI , 

45 a motion analysis ROI and/or an infrared spectral ROI. The 
contextual analyser 410 itself may include an audio analys- ' 
ing unit 710, a motion analysing unit 720, and an infrared ? 
analysing unit 730. The outputs produced by the contextual 
analyser 410 are used by the adjustment unit 430 to alter the 

50 prior probabilities of the labels in the apphcalion domain 
440 used by the region analyser 470 of FIG. 4. The audio 
analysing unit 710 may achieve this result by recognising 
key words or phrases in the digital audio signal located in the 
audio signal ROI and then checking to see whether these key 

55 words/phrases suggest that any particular label(s) are more 
likely to occur in the frame than other labels. Other contex- 
tual analyser units (eg., 720, 730) may directly alter the prior 
probabilities of the labels. 

In the preferred embodiment having a frame event analy- 

60 ser 210 with a single application domain 440, the prior 
probabilities for the labels may be adjusted by the adjusting 
unit 430 on the basis of a list of key words/phrases 420 being 
stored for each label in the application domain 440 with a 
prior probability-weighting factor for each key word/phrase. 

65 The higher the probability-weighting factor is, the more 
likely that a region described by that label exists in the 
frame. Other contextual analysis results may also be pro- 
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vided to the adjusting unit 430, in addition to or in place of on the RAG. Thai is, X is a random field, where X,- is the 

key words 420. random variable associated with R,-. The realisation x, of X,- 

The frame segmenter 450 segments a frame into homo- is a member of the set of labels, L. A neighbourhood system 

geneous regions using a region-based segmentation method. r on G is denoted by: 
Typically, the segmentation method uses contextual infor- 5 

mation extracted from the ROIs of the different information r={ii(R^; ISiSN} (i) 

sources (eg., video UOA and audio llOB) to assist the ) ^ .^^set of R that contains neighbours of R.. 

segmentation process. For example, inotion vectors can preferably, a neighbourhood system for a region R, is that 

assist m the differentiation of movmg objects from a back- „ ^, ^^^^^^ j^^^ ^j^^ ^^^^ ^^^^^ 

ground. If focal information is available then this infonna- lo boundary with R- 

tion can be used to estimate distances and therefore differ- Further, Q is the set of all possible labelling 

entiate between different object or region planes m the configurations, <o denotes a configuration in Q: 
frame. The result of the segmentation process earned out by 

the frame segmenter 450 is a RAG 460, such as that shown fi={(o={xi, x^, X3, . . . , Xj^}: x,<L, i^i^N} (2) 
in FIG. 3, which is provided as input to the region analyser 35 

470. Preferably this RAG is three-dimensional. The other Then X is a MRF with respect to the neighbourhood system 

input to the region analyser 470 is the application domain T if: 
440 in which the prior probabilities of the labels may have 

been adjusted depending on the contextual information. P(X»a))>o, for all realisations of X; 

The probabiUstic-modei-based region analyser, 470; 20 p(x^x,[x^x^ r,*r^)=p(x^xJx^x^, RycnCR,)). (3) 
labels optimally the regions in the RAG using the appropri- p 

ate application domain, j440. The .rftsiulting labelled RAG^ . An important feature of the MRF is that its joint prob- 

represents a description of the content of the frame, or ability density function, P(X=o)), has a Gibbs distribution.: 
rnelaSafaTffiaf^anDe such' ^ 

as comem%Wd rctrieval.-Preferably^^^.^^^ 25 PCX-a))-z-^exp[-u((o)^, (4) 

where T is the temperature, and U(a)) is the Gibbs energy 

, RAG jaO^^^O^ is.descnbed m detail m the ^^^^^^ ^^^^^ ^^^^^^ ^ > ^.^^^ ^ 

following paragraphs. ° 

Turning to FIG. 5 there is shown a frame event analyser y. .5. 

which can be described in a substantially similar manner to ~ ^ exp[-f/(w)/r]. 
that of FIG. 4, excepting that the frame event analyser now 
has multiple application domains. In a frame event analyser 

having multiple application domains (ie., as depicted in FIG. The energy function can be expressed using the concept 

5), each application domain could contain a list of key of "cliques". A clique c, associated with the graph G, is a 

words/phrases and the role of a selecting unit 530 can subset of R such that it contains either a single region or 

include the selection of an application domain to be used in several regions that are all neighbours of each other. The 

the analysis. Thus, the selecting unit 530 selects a most cliques for each region in the RAG depicted in FIG. 3B are 

probable application domain and preferably adjusts prior listed in FIG. 8. The region Rl has associated cliques {Rl}, 

probabilities of labels in the select domain. {Rl, R2}, and {Rl, R3}, for example. 

Referring to FIG. 6, there is shown a time-line 600 for a The set of cliques for the graph G is denoted C. A clique 

videosegment.Acurrentframe601of the video segment is function is a function with the property that V^(o)) 

analysed with reference to one or more regions of interest depends on the x,- values (ie., labels) for which (iec). Since 

(ROIs) 602 extracted from available contextual information a family of clique functions is called a potential, U(a)) can 

603. audio information 604 (signal) and video information obtained by summing the clique functions for G: 
605 (signal). 

Temporal boundaries of the HOIs may vary with the type f/(£o) = ^ v^icj). (6) 

of contextual information (see FIG. 6). For example, con- '^^'^ 
textual information such as camera paramaters, may extend 

over a large temporal period, maybe the entire video seg- 50 Region-based feature measurements obtained from the 

meni. in con^asi, me KUi lor me viaeo signal may be much frame and orior knowledce are incorporated into the clique 

shorter, maybe several frames before and after the frame f^^^^^^^ y ^h^ likelihood of a particular region label L, 

being currently analysed. ROIs are not necessarily centred , ^^^^ f.^j^,, measurements can be estimate^l 

over the current frame, as shown m FIG. 6. They can. for ^.^ous methods which could involve the use of a 

example, just include previous frames. 55 ^^^^^^ (^g^ ^^^^^ networks) or may be based on 

MathematicaUy a RAG is defined to be a graph G which empirical knowledge. Similarly, prior knowledge can be 

contains a set R of disjoint regions and a set E of edges incorporated into the clique functions in the form of 

connecting the regions; G«{R. E}, Video frame interpreta- constraints that may or may not be measurement-based. For 

tion seeks optimally label the regions in G. If an application example, the constraints may be of the form that label L- and 

domain consists of a set of p labels, L-{Li, L^, L3, . - . , L^} 60 L,- cannot be adjacent (i.e., have zero probabiUty of being 

with prior probabiUties, Pr,.={Pr^„ Pr^^, Pr^3, . . . , Pr^^}, neighbours). Alternatively, if L, and L- are adjacent, the 

which have been biased by an analysis of the contextual boundary is likely to have certain characteristics (eg., fractal 

mformation, then the interpretation process can be viewed as dimension), and the value of the constraint might depend on 

one of estimating the most probable set of labels on the a measurement. 

S^^P^ ^- 65 Equations 4 to 6 show that minimising the Gibbs U( ) 

If the graph G consists of N disjoint regions, then let energy for a configuration is equivalent to maximising its 

X={Xi, X2, X3, . . . , Xjv} be a family of random variables probability density function. The preferred embodiment of 
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the invention seeks to find an optimum region label con- 
figuration given measurements obtained from the frame M, 
prior knowledge about the labels K and the prior probabili- 
ties of the labels in the application domain Pr. The prior 
probabilities of the labels are biased by an analysis of 
contextual information. The problem of optimising the 
labels over the entire RAG (of the frame) can be solved by 
iteratively optimising the label at any site, i. The dependence 
of the label at region i on M, K and Pr is incorporated into 
the designed clique functions V^(co). Therefore the condi- 
tional probability density function for X,- being X,^t site i can 
be written as: 



Zi = ^exp 



ceC; 



The prior probability of region i being "sky", Pr^jt^, could 
also be incorporated into clique functions. One method of 
doing this would be to multiply an existing unary clique 
function by a factor such as: 



(8) 



(7) 



20 



35 



where Q is the subset of C that consists of cliques that 
contain X,- and co'' denotes the configuration which is x at site 
i and agrees with co everywhere else. The prior probabilities 
of the labels can also be used to bias the initial labels of the 
sites. For example, labels of previous analysis events could 25 
be used to initialise a graph for a later analysis event. 

As mentioned above, clique functions can be based on 
feature measurements from the frame M, prior knowledge 
about the labels K, and prior probabilities of the labels Pr. 
Consider, for example, the label "sky" in an Outdoor Scenes 
application domain. The set of cliques involving region 
(site) i on the RAG (i.e., C,) would typically consist of a 
unary clique consisting of just region i and a set of cliques 
that involve groups of regions, each including region i, in 
which each region is a neighbour of each other region in the 
group. 

The unary clique function could be calculated by mea- 
suring a collection of features for the region i and then using 
these feature measurements as input to a neural network that 
has been previously trained using examples of sky regions 
from manually segmented images. Examples of possible 
features which could be measured for a region include mean 
R, G and/or B values, mean luminance, variance of the 
luminance in the region, texture features which may involve 
measurements derived in the frequency domain, and region 
shape features such as compactness. The neural network 45 
would typically be trained to generate a low value (eg., zero) 
for regions that have feature measurements that resemble 
those of the manually segmented sky regions and a high 
value (eg., 1.0) for those regions that have feature measure- 
ments which are very dissimilar to those of the manually 
segmented regions. 

Feature measurements can also be used in clique fiinction 
which involve more than one region. For example, the 
tortuosity of a common boundary between two region could 
be used in a cUque function involving a pair of regions. For 
example, the common boundary between a "sky** and a 
"water" region would typically not be very tortuous whereas 
the common boundary between "foliage" and "sky" could 
well be very tortuous. 

Prior knowledge can be incorporated into the clique 
functions in the form of constraints. For example, a clique 
function involving a "sky" label and a "grass" label might 
return a high energy value (eg., 1.0) if the region to which 
the "grass" label is being applied is above the region to 
which the "sky*' label is being applied. In other words, we 
are using our prior knowledge of the fact that the "sky" 
regions are usually located above the "grass" regions in 
frames. 
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where is some parameter in the range of (0,1) that weights 
the contribution of the prior probability to the overall clique 
function. Prior probabilities oould also be incorporated into 
clique functions involving more than one region. In this 
case, the multiplying factor or the clique function would 
typically involve the prior probabilities of each of the labels 
in the clique function. 

Equation 7 demonstrates that selecting the most probable 
label at a site is equivalent to minimising the weighted, by 
prior probability of the label, Gibbs energy function U(a)) at 
the site. The optimum region label configuration for the 
frame can be obtained by iteratively visiting each of the N 
sites on the graph G and updating the label at each site. 
There exist several methods by which the region labels are 
updated. Anew label can be selected for a region from either 
a uniform distribution of the labels or from the conditional 
probability distribution of the MRF (i.e., the Gibbs Sampler, 
see Geman and Geman, IEEE Trans. Pattern Analysis and 
Machine Intelligence, 6, pp. 721-741, 1984). If more rapid 
convergence is desirable, then the iterated conditional 
modes (described by Besag, J., 7. R. Statistical Soc. B, 48, 
pp.259-302, 1986) method may be used. In the latter 
method, sites on the RAG are iteratively visited and, at each 
site, the label of the region is updated to be the label that has 
the largest conditional probability distribution. The iterative 
procedure of visiting an updating sites sites can be imple- 
mented within a simulated annealing scheme (where the 
temperature is gradually decreased). The method of updat- 
ing is not critical for this embodiment of the invention. 
Instead, it is the inclusion of the prior probability in the 
calculation of the Gibbs energy U((x)) that is a critical. 

6. Contextual Analyser 

The contextual analyser 410 in FIG. 4 takes the current 
frame and a ROI 400 for each of the information sources 
(eg., video signal llOA, and audio signal HOB) and provides 
information to the adjusting unit 430 on how the prior 
probabilities of the labels in the application domain 440 
should be biased. The function of the contextual analyser 
410 as depicted in FIG. 5 has already been discussed in 
association with the frame event analyser 202 of FIG. 2. A 
method of adjusting the prior probabilities of labels in an 
application domain 440 based on the presence of various key 
words/phrase in the audio signal ROI will be described in 
more detail hereinafter. Similar methods can be used for 
other contextual information 

Each label can be associated with one or more evidence 
units, where an evidence unit comprises a key word or 
phrase and a weight factor between 0 and 1. For example, an 
evidence unit for the label "water" might consist of the key 
word "beach" and a weighting factor of 0.8. The value of the 
weighting factor implies the likelihood that the existence of 
the key word in the audio ROI indicates that "water" is the 
appropriate label for at least one region in the RAG. 

Before evidence is collected the sum of the prior prob- 
abilities of all labels should sum to 1.0. In other words: 
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1.0 



(9) 



As evidence is collected from the ROIs of the contextual 
information, evidence units are instantiated. The weight 
factors for the different instantiated evidence units for a 
given label 1, can be summed to generate the total evidence 
for the label, E;. 

The Pr; values for the labels in the application domain 440 
can then be calculated using. 



10 



Prp(l+Ei)x, 

where the value of x is obtained by solving: 

L 

^(1+£/)j:=:1.0. 



(10) 



(11) 



20 



30 
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The resulting Pr, values can be used directly by the clique 
functions (see for example Equation 8). 

7. Alternative Embodiments of the Invention 25 

FIG. 11 shows the video segment analyser 140 according 
to an alternative embodiment of the invention in which the 
video segment analyser 140 is integrated with an object- 
based digital video coding system. In block 250, a first frame 
in a video segment 130 generated by the video segmenter 
120 of FIG. 1 is loaded into the video segment analyser 140. 
The frame event analyser 252 receives the loaded frame and 
analyses the frame using the contextual information from the 
relevant ROI as described for the frame event analyser 202 
in FIG. 2A resulting in a labelled RAG.'fhe labelled RAG 
is then output by the frame event analyser 252 to a region 
encoder 254 which encodes the RAG. The region encoder 
254 encodes the regions of the RAG, including their adja- 
cency and depth information and semantic labels into a 
bitslream. In block 256, a check is made to determine if the 
end of the video segment has been reached. If the checking 
block 256 returns tme (yes), then video segment processing / 
terminates in block 258. Otherwise, if checking or decision ^ 
block 256 returns false (no), the next frame in the video/^^ 
segment is loaded in block 260,-^ 

The motion detector 262 detects motion in the video 
segment on a frame-by-frame basis. It examines any motion, 
detected from the previous frame, on a region basis, if the? 
motion of individual regions can beTlescribed by a motioij' 50 
model (eg., an afifine transformation of the region), the 
model parameters are encoded in the bit stream in block 266, 
If the detected motion cannot be described by the motion" 
model then the frame is analysed by the frame event analyser/ 
252 and a new RAG is generated and encoded by the region 55 
encoder 254. 

In the video segment analyser 140 depicted in FIG. 11, the 
semantic labels are preferably integrated with the coded 
digital video signal. If the video segment analyser is inte-, 
grated with the digital video coding system, the regions may/ ^60 
be separately coded in a resolution independent manner. ^ 
This enSbles simple reconstruction of a digital video signal 7 
' at any desired resolution. The method of encoding the digital ^ 
video signal may be carried out xising any of a number of - 
'such techniques well known to those skilled in the art. 65 
Clearly, the video segment analyser 140 does not necessarily — 



Instead, as noted hereinbefore, the video segment analyser 
140 may just generate metadata. In such an embodiment, it 
may not be necessary to process all the video frames in a 
segment. In other words, only selected frames in a segment 
need be analysed. It is not the objective of the embodiments 
of this invention to specify how frames are selected as such 
selection depends to a large extent on the implementation. 
For example, a video interpretation system may need to 
work close to real-time. 

Another alternative embodiment of the invention is one in 
which the video frame segmentation and region labelling 
process are combined in a single minimisation process. 

What is claimed is: 

1. A method of interpreting a digital video signal, wherein 
said digital video signal has contextual data, said method 
comprising the steps of: 

segmenting said digital video signal into one or more 
video segments, each segment having one or more 
video frames, and each said video frame having a 
corresponding portion of said contextual data; 

determining a plurality of regions for a video frame of a 
respective video segment; 

processing said regions for each video segment to provide 
a region adjacency graph at one or more temporal 
instances in the respective video segment, said region 
adjacency graph representing adjacencies between 
regions for a corresponding frame of said respective 
video segment; and 

analyzing said region adjacency graphs to produce a 
corresponding labeled region adjacency graph compris- 
ing at least one semantic label, said analysis being 
dependent upon a corresponding portion of said con- 
textual data, wherein said labeled region adjacency 
graph represents an interpretation of said digital video 
signal. 

2. The method according to claim 1, wherein said con- 
textual data comprises information generated by one or more 
separate sources of said information. 

3. The method according to claim 2, wherein the corre- 
sponding portion of said contextual data is obtained from a 
temporal region of interest for each source of contextual 
information, said temporal region of interest being relative 
to a video segment being analyzed. 

4. The method according to claim 2, wherein said con- 
textual data includes portions of the video signal. 

5. The method according to claim 2, wherein said con- 
textual data includes at least one data type selected from the 
group consisting of: 

audio data; 
electromagnetic data; 
focal point data; 
exposure data 
aperture data 

operator gaze location data 

environmental data; 

time lapse or sequential image data; 

motion data; and 

textual tokens, 

6. The method according to claim 5, wherein said textual 
tokens include textual annotation, phrases and keywords 
associated with at least a portion of a respective video 
segment. 

7. The method according to claim 1 said analyzing step 



need to be integrated with a digital video coding system. / comprising the further sub-step of: 
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biasing a statistical or probabilistic interpretation model temporal instances in the respective video, said region 

dependent upon said corresponding portion of contex- adjacency graph representing adjacencies between 

tual data. regions for a corresponding frame of said respective 

8. The method according to 7, wherein said biasing step video segment; and 

includes the step of selecting a predetermined application j ^^^^ analyzing said region adjacency graphs to 

domain from a plurahty of application domains dependent „,„j„„= , uh^i^,t \^,ri^i .jif™. 

upon said corresponding portion of contextual data, said P'°^^^" " correspondmg labeled region adjacency 

application domain compJ^ing a set of semantic labels ^raph comprising at least one semantic label said 

appropriate for use with said application domain. ^"^^y^^^ dependent won a correspondmg portion 

9. The method according to claim 7, wherein said biasing contextual data, wherein said labeled region 
step includes the step of adjusting at least one prior prob- adjacency graph represents an mterpretation of said 
ability of a respective one of a pluraUty of semantic labels digital video signal. . . . , 

in at least one application domain. ^3. The apparatus accordmg to claim 22, further com- 

10. The method according to claim 7, wherein each said P"^"^^ ^'^^'^^ ^ statistical or probabiUstic inter- 
analyzing step is dependent upon said biased statistical or , pretation model dependent upon said corresponding portion 
probabilistic interpretation model. ^f contextual data. 

11. The method according to claim 10, wherein each said ^4. The apparatus accordmg to claun 23, wherem said 
analyzing step is dependent upon said at least one appHca- ^'^^'""^ "^^^"^ includes means for sclectmg a predetermmed 
tion domain. apphcation domain from a plurality of application domains 

12. The method according to claim 7, wherein each said dependent upon said corresponding portion of contextual 
analyzing step includes the step of analyzing regions of said ^^'^ application domam comprismg a set of semantic 
region adjacency graph using said biased statistical or proba- ^^^^^^ appropnate for use with said apphcation domain, 
bilistic interpretation model to provide said labeled region apparatus accordmg to claim 23, wherem said 
adjacency graph. biasing means includes means for adjusting at least one prior 

13. The method according to claim 12, wherein said Probability of a respective one of a plurality of semantic 
region analysis step is dependent upon adjusted prior prob- ^^^^l^ ^^^^ ^'^^ appUcation domain. 

abiUties of a plurality of labels to provide said labeUed apparatus according to claim 23, wherein said 

region adjacency graph. analysing means utilises said biased statistical or probabi- 

14. The method to claim 7, wherein the statistical or ^^^^^^ interpretation model. 

probabilistic interpretation model is a Markov Random apparatus according to claun 26, wherein said 

Pield. analysing means includes at least one application domain. 

15. The method according to claim 1, further comprising apparatus according to claim 23, wherein said 
the step of encoding said region adjacency graph to form a analysmg means includes means for analysing regions of a 
bitstream representation of said region adjacency graph. '^Sion adjacency graph using said biased statistical or proba- 

16. The method according to claim 15, farther comprising ^^^^^^ mterpretation model to provide a labelled region 
the step of: adjancey graph. 

representing motion between two successive video frames ^\ apparatus according to claim 28, wherein said 

using a predetermined motion model comprising W analysis means uses adjuste^^^^ 

encoded parameters. plurality of labels to provide said labelled region adjacency 

17. The method according to claim 16, further comprising .n ^'^^^'nn. 

the step of combining each encoded video segment and apparatus according to claim 22, wherem said 
respective encoded region adjacency graph to provide a digital video signal is generated usmg a digital video record- 
motion encoded digital video signal. ^1?^* , . 

18. The method according to claim 1, further comprising ^% apparatus to clami 30, wherein one or more 
the step of providing metadata associated with each video ^/ '^.'f ^^^^^^^^^.^i ^^^^ generated by one or more 
segment, wherein said metadata includes said region adja- sensors of said digital video recording device. 

cency graphs of each video segment. ^f: ^ ""^"^^P^^^^ program stored m a computer readable 

19. The method according to claim 1, wherein said medium, said computer program being configured for inter- 
analysis step is carried out on a video frame and a temporal a digital video signal, wherem said digital video 
region of interest of the respective contextual data. "'Snal has contextual data, said computer program compris- 

20. The method according to claim 1, wherein said digital ^"^* 

video signal is generated using a digital video recoding ^^^^ segmenting said digital video signal into one or 

device. more video segments, each segment having one or 

21. The method according to claim 20, wherein one or video frames, and each said video frame having 
more portions of said contextual data is generated by one or 55 ^ corresponding portion of said contextual data; 
more sensors of said digital video recording device. code for determining a plurality of regions for at least one 

22. An apparatus for interpreting a digital video signal, video frame of a respective video segment; 
wherein said digital video signal has contextual data, said code for processing said regions for each video segment 
apparatus comprising: to provide a region adjacency graph at one or more 

means for segmenting said digital video signal into one or go temporal instances in the respective video segment, 

more video segments, each segment having one or said region adjacency graph representing adjacencies 

more video frames, and each said video frame having between regions for a corresponding frame of said 

a corresponding portion of said contextual data; respective video segment; and 

means for determining a plurality of regions for a video code for analyzing said region adjacency graphs to pro- 
frame of a respective video segment; es duce a corresponding labeled region adjacency graph 

means for processing said regions for each video segment comprising at least one semantic label, said analysis 

to provide a region adjacency graph at one or more being dependent upon a corresponding portion of said 
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contextual data, wherein said labeled region adjacency 
graph represents an interpretation of said digital video 
signal. 

33. The computer program according to claim 32, further 
comprising code for biasing a statistical or probabilistic 
interpretation model dependent upon said corresponding 
portion of contextual data. 

34. The computer program according to claim 33, wherein 
said biasing code includes code for selecting a predeter- 
mined application domain from a plurality of application 
domains dependent upon said corresponding portion of 
contextual data, said application domain comprising a set of 
semantic labels appropriate for use with said application 
domain, 

35. The computer program according to claim 33, wherein 
said biasing code includes code for adjusting at least one 
prior probability of a respective one of a plurality of seman- 
tic labels in at least one application domain. 

36. The computer program according to claim 33, wherein 
said analyzing code utilizes said biased statistical or proba- 
bilistic interpretation model. 

37. The computer program according to claim 36, wherein 
said analyzing code includes at least one application 
domain. 

38. The computer program according to claim 33, wherein 
said analyzing code includes code for analyzing regions of 
a region adjacency graph using said biased statistical or 
probabilistic interpretation model to provide a labeled region 
adjacency graph. 

39. The computer program according to claim 38, wherein 
said region analysis code uses adjusted prior probabilities of 
a plurality of labels to provide said labeled region adjacency 
graph. 

40. The computer program according to claim 32, wherein 
said digital video signal is generated using a digital video 35 
recording device. 

41. The computer program according to claim 40, wherein 
one or more portions of said contextual data is generated by 
one or more sensors of said digital video recording device. 

42. A method of interpreting a digital video signal, 40 
wherein said digital video signal has contextual data, said 
method comprising the steps of; 

segmenting said digital video signal into one or more 
video segments, each segment having one or more 
video frames and a corresponding portion of said 
contextual data; 

determining a plurality of regions for at least one video 
frame of a respective video segment; 

processing said regions for each video segment to provide 
a region adjacency graph at one or more temporal 
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instances in the respective video segment dependent 
upon said corresponding portion of said contextual 
data, said region adjacency graph representing adjacen- 
cies between regions for a corresponding frame of said 
respective video segment; and 
analyzing said region adjacency graphs to interpret said 
digital video signal. 

43. An apparatus for interpreting a digital video signal, 
wherein said digital video signal has contextual data, said 
apparatus comprising: 

means for segmenting said digital video signal into one or 
more video segments, each segment having one or 
more video frames and a corresponding portion of said 
contextual data; 

means for determining a plurality of regions for at least 
one video frame of a respective video segment; 

means for processing said regions for each video segment 
to provide a region adjacency graph at one or more 
temporal instances in the respective video segment 
dependent upon said corresponding portion of said 
contextual data, said region adjacency graph represent- 
ing adjacencies between regions for a corresponding 
frame of said respective video segment; and 

means for analyzing said region adjacency graphs to 
interpret said digital video signal. 

44. A computer program stored in a computer readable 
medium, said computer program being configured for inter- 
preting a digital video signal, wherein said digital video 
signal has contextual data, said computer program compris- 
ing: 

code for segmenting said digital video signal into one or 
more video segments, each segment having one or 
more video frames and a corresponding portion of said 
contextual data; 

code for determining a plurality of regions for at least one 
video frame of a respective video segment; 

code for processing said regions for each video segment 
to provide a region adjacency graph at one or more 
temporal instances in the respective video segment 
dependent upon said corresponding portion of said 
contextual data, said region adjacency graph represent- 
ing adjacencies between regions for a corresponding 
frame of said respective video segment; and 

code for analyzing said region adjacency graphs to inter- 
pret said digital video signal. 
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